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the current load on the computer and the load on at least some of the other computers in the network by 
causing the computer periodically to distribute the information to the computers to which it is logically linked, to 
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so that the information can be used to determine a computer in the network that can accept extra load. A 
sender -initiated embodiment of the invention includes the further step of, when the computer is overloaded, 
using the information to determine a computer that can accept extra load and transferring at least one task to 
that computer. The load balancing technique is scalable, fault tolerant, flexible and supports clustering thus 
making it suitable for use in networks having very large numbers of computers. 
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The invention relates to nnethods of operating computers in networks and to computers capable of 
operating In networks. 

Local area networks of computers are an Indispensable resource in many organisations. One problem in 
a large network of computers is that it is sometimes difficult to make efficient use of all the computers in the 
6 network. Load sharing or balancing techniques increase system throughput by attempting to keep all 
computers busy. This is done by off - loading processes from overloaded computers to idle ones thereby 
equalising load on all machines and minimising overall response time. 

Load balancing methods can be classified according to the method used to achieve balancing. They 
. . can-be 'static* or 'dynamic' and central' or 'distributed'. 

10 In static load balancing, a fixed policy - deterministic or probabilistic - is followed independent of the 
current system state. Static load balancing is simple to implement and easy to analyse with queuing 
models. However, its potential benefit is limited since it does not take into account changes in the global 
system state. For example, a computer may migrate tasks to a computer which is already overloaded. 

In dynamic schemes the system notes changes in its global status and decides whether to migrate 

/5 tasks based on the current state of the computers. Dynamic policies are inherently more complicated than 
static policies since they require each computer to have knowledge of the state of other computers in the 
system. This information must be continuously updated. 

In a central scheme one computer contains the global state information and makes all the decisions. In 
a distributed scheme no one computer contains global information. Each computer makes migration 

20 decisions based on the state information it has. 

Load -balancing methods can be further classified as 'sender - initiated' and 'receiver -initiated'. In 
sender -initiated policies an overloaded computer seeks a lightly loaded computer. In receiver -initiated 
policies lightly loaded computers advertise their capability to receive tasks. It has been shown that if costs 
of task -transfer are comparable under both policies then sender - initiated strategies outperform receiver - 

25 initiated strategies for light to moderate system loads, while receiver - Initiated schemes are preferable at 
high system loads. 

Conventionally, a dynamic load balancing mechanism is composed of the following three phases: 

1. Measuring the load of the local machine. 

2. Exchanging local load information with other machines in the network. 
30 3. Transferring a process to a selected machine. 

Local load can be computed as a function of the number of jobs on the run queue, memory usage, 
paging rate, file use and I/O rate, or other resource usage. The length of the run queue is a generally 
accepted load metric. 

The receipt of load information from other nodes gives a snapshot of the system load. Having this 
35 information, each computer executes a transfer policy to decide whether to run a task locally or to transfer it 
to another node. If the decision is to transfer a task then a location policy Is executed to determine the node 
the task should be transferred to. 

If a load balancing method is to be effective in networks having large numbers of computers it must be 
scalable, in the sense that network traffic increases linearly with the size of the network, flexible so that 
40 computers can be easily added to or removed from the network, robust in the event of failure of one or 
more of the computers and must support to some degree clustering of the computers. 

Centralised methods, though attractive due to their simplicity are not fault - tolerant The central 
computer can detect dead nodes and It takes one or two messages to re - establish connection. However, If 
the central node fails, the whole scheme falls apart. Load balancing configuration can be re-established by 
45 maintaining a prioritised list of alternative administrators in each node or by implementing an election 
method to elect a new centralised node. When a central node fails, other nodes switch to the new central 
node. This enhancement, however, increases the management complexity and cost of the method. 

In addition, management overhead for a centralised method is unacceptably large. As the number of 
nodes increases, the time spent by the central node in handling load information increases, and there must 
50 come a point at which it will overload. 

Several prior art load balancing methods are both dynamic and distributed. 

For example, in the method described in Barak A. and Shiloh A. SOFTWARE - PRACTICE AND 
EXPERIENCE, 15(9):901 - 913, September 1985 [R1] each computer maintains a fixed size load vector that 
is periodically distributed between computers. Each computer executes the following method. First, the 
55 computer updates its own load value. Then, it chooses a computer at random and sends the first half of its 
load vector to the chosen computer. When receiving a load vector, each computer merges the information 
with its local load vector according to a predefined rule. 
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A problem with this method is that the size of the load vector has to be chosen carefully. Finding the 
optimal value for the load vector is difficult and it must be tuned to a given system. A large vector contains 
load information for many nodes. This, many nodes will know of a lightly loaded node, increasing the 
chance that it will quickly receive many migrating processes and will overload. On the other hand, the size 
5 of the load vector should not be so small that load values do not propagate through the network in a 
reasonable time. For large number n of nodes in the network, the expected time to propagate load 
information through the network is 0(log n). 

The method described in R1 can handle any number of computers after, tuning the length of the load 
vector. Therefore, the method is scalable. However, it is flexible only up to a point. Adding a small number 

70 of nodes to the network does not require any change. Adding a large number of nodes requires changing 
the size of the load vector in all computers, thus increasing the administrative overhead. The method is 
fault - tolerant and continues to work in spite of single failures, however there is no built-in mechanism for 
detecting dead nodes and updating information on them in the load vectors. This information about failed 
nodes is not propagated and other nodes may continue to migrate processes to them, with the result of 

15 decreased response time. 

The number of communications per unit time is 0(n). However, the method does not support clustering. 
When a node wants to off-load processes to another node it chooses a candidate from its load vector. 
Since the information on nodes in the load vector of node p Is updated from random nodes, the set of 
candidates for off-loading Is a random subset of n and not a controlled set for p. 

20 In Lin F. C. H. and Keller R. M. PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON 
DISTRIBUTED COMPUTING SYSTEMS, 329-336, May 1986 [R2], a distributed and synchronous load 
balancing method using a gradient model is disclosed. Computers are logically arranged in a grid and the 
load on the computers Is represented as a surface. Each computer interacts solely with its immediate 
neighbours on the grid. A computer with high load is a *hiir and an idle computer is a Valley'. A neutral 

25 computer is one which is neither overloaded nor idle. Load balancing is a form of relaxation of the surface. 
Tasks migrate from hills towards valleys and the surface is flattened. Each computer computes its distance 
to the nearest idle computer and uses that distance for task migration. 

An overloaded node transfers a task to its immediate neighbour in the direction of an idle node. A task 
continues to move until it reaches an idle node. If a former idle node is overloaded when a task arrives, the 

30 task moves to another idle node. 

This method, though scalable, is only partially flexible. It is easy to add or remove nodes at the edge of 
the grid, but it is difficult to do so in the middle since the grid is fixed. To do this, reconfiguration is 
required, which increases the overhead inherent in the method. Detection of dead nodes is not easy. Since 
only changes of the state of a node are sent to its neighbours, a node's failure remains undetected until a 

35 job is transferred to it. Late detection delays migrating processes and, therefore, increases overall response 
time. In this method clustering is not supported. Each node transfers processes to one of its immediate 
neighbours which may transfer it in a different direction. The overloaded node has no control as to where 
the off-loaded processes will eventually execute. Management overhead to start the grid is low and the 
number of messages is 0(n). Administrative overhead for reconfiguration is high since it requires changes in 

40 all neighbouring nodes for a failed node. In this method node overhead is high and network traffic increases 
because tasks move in hops rather than directly to an idle node and it takes long time to migrate a process. 

The buddy set algorithm proposed in Shin K. G. and Chang Y. C. IEEE TRANSACTIONS ON 
COMPUTERS, 38(8), August 1989 [R3], aims for very fast load sharing. For each node two lists are defined: 
Its buddy set, the set of its neighbours, and its preferred list, an ordered list of nodes in its buddy set to 

45 which tasks are transferred. Each node can be in one of the following states: under-loaded, medium - 
loaded and overloaded. Each node contains the status of all nodes in its buddy set. Whenever the status of 
a node changes, it broadcasts its new state to all nodes in its buddy set. The internal order of preferred lists 
varies per node in order to minimise the probability that two nodes in a buddy set will transfer a task to the 
same node. When a node is overloaded it scans its preferred list for an under - loaded node and transfers a 

50 task to that node. Overloaded nodes drop out of preferred lists. Buddy sets and their preferred lists overlap, 
so processes migrate around the network. 

Again, this method, though scalable in that the number of messages increases linearly with the number 
of computers in the network n, is not flexible. Adding or removing nodes from the network requires 
recomputation of the preferred lists in all nodes of the related buddy set If a node is added to more than 

55 one buddy set the recomputation must be done for each node in each buddy set. Detection of dead nodes 
is difficult for the same reason as in the gradient model described in R2. Clustering is supported but 
reconfiguration is expensive since it requires recomputation of new buddy sets and preferred lists, as for 
adding and removing nodes. The administrative overhead inherent in the method is therefore high. 
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According to a first aspect of the present invention there is provided a method of operating a computer 
in a network of computers, the method comprising: generating logical links between the computer and other 
connputers in the network so that a tree structure is formed, the computer being logically linked to one 
computer higher up the tree and a number of computers lower down the tree; and maintaining in the 
5 computer stored information regarding the current load on the computer and the load on at least some of 
the other computers in the network by causing the computer periodically to distribute the information to the 
computers to which It is linked, to receive from said computers similar such Information and to update its 
own information in accordance therewith, so that the information can be used to determine computers in the 
- network that can accept extra load. 
10 The invention is applicable to both sender -initiated and receiver -initiated load balancing. A embodi- 
ment of the invention using a sender -initiated load balancing technique includes the further step of, when 
the computer is overloaded, using the information to determine a computer that can accept extra toad and 
transferring at least one task to that computer. 

There is further provided a method of operating a network of computers by operating each computer 
75 using the above method. 

The generation of the logical links can be achieved by assigning a rank to each computer, no two 
computers being assigned the same rank, each computer being logically linked to one computer of lower 
rank and a number of computers of higher rank to form the tree structure. 

The tree structure can be maintained if a computer fails, or is otherwise inoperative, by generating new 
20 logical links between each of the computers lower down the tree to which the failed computer was linked 
and other computers, which have capacity for accepting new downward links. 

An improved load balancing technique is employed which has the property of scalability by limiting the 
communication load on each computer and the knowledge of the network topology which is required to be 
stored in each computer. The actions taken by each computer are thus independent of the total number of 
25 computers in the network. 

Since the tree structure is dynamically built and maintained, the method is also flexible. Changing a tree 
configuration requires simply changing the ranking of the reconfigured computers. The new ranking files 
have to be consistent with existing ranking files to avoid cycles. A computer is reconfigured by logically 
disconnecting It so that it appears inoperative to Its neighbours, replacing its configuration file, and letting it 
30 reconnect to the tree. A new configuration is achieved without disturbing other nodes on the tree and 
without disrupting local services. 

Detection and reconnectlon of failed nodes is also provided for. If a computer higher up the tree does 
not respond within a certain time each higher ranked computer previously connected to it tries to relink 
itself elsewhere in the tree. If it temporarily fails to do this, It acts as root of a disconnected sub - tree, and 
35 load balancing occurs within the sub -tree. Eventually, the flags marking nodes as already tried or dead will 
be reset, and the computer, which periodically tries to attach itself to the tree, will reconnect to the tree. 

Advantageously, the periodic distribution of load information across the links of the tree is used by each 
computer to determine whether or not the computers to which it is linked are operational, each computer 
seeking, If the computer to which it is linked higher in the tree Is not operational, to generate a new link with 
40 a computer having lower rank and having the capacity for new downward links and being marked. If one of 
the computers lower in the tree to which it is linked is not operational, as having capacity to accept new 
downward links. 

In a preferred form of the invention, the information stored In each computer contains a number of 
entries, each entry containing information regarding the load on a particular one of the computers in the 
45 network, i, the number of links in the tree separating it from the computer in which the information Is stored, 
and the name (ie, rank) of the computer, logically linked to the computer in which the information is stored, 
from which the entry was last received; and wherein when each computer receives the similar information 
from a computer to which it is logically linked the following steps are performed. 

a) the number of links separation value in each entry of the received information is incremented by one; 
50 b) entries In the received information which originated from the receiving computer are deleted; 

c) entries In the information already stored in the receiving computer which were received from the 
sending computer are deleted; 

d) the received Information is merged with the Information already stored in the receiving computer; 

e) the merged information is sorted In ascending order of load, entries with equal load being sorted in 
55 ascending order of number of links separation from the receiving computer. 

Before each distribution of information between computers, entries in the information relating to 
computers which have the same load and number of links separation from the computer can be randomly 
permuted, so that the probability that the same entry will appear first in the information stored in different 
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computers is reduced. Also, if an entry corresponding to the computer appears first in its own sorted 
information a counter can be attached to the entry and initialised to the negative value of the spare load 
capacity of the computer, the counter being incremented, if that entry is still first, whenever the information 
is distributed, and if the counter becomes positive that entry is removed from the information. The 
5 probability that the same entry will appear first in the information stored in different computers is thereby 
reduced. 

In an advantageous form of the invention the period of the periodic distribution of the information to a 
computer higher up the tree Is related to the rate at which new tasks are created in the computer and/or the 
rate at which new tasks are created in the computers to which it is logically linked in the tree structure. This 

70 has the advantage that if one or more nodes suddenly become highly loaded this information will spread 
more quickly throughout the network. 

A further advantage of the method provided in the present invention is that the case where the overall 
load in the network increases steadily is handled gracefully. In such a case, when local load vectors are 
searched for a target computer, fewer candidates will be found and the number of migrations will decrease. 

15 When overall load decreases, more candidates will be found in the local load vectors and migrations will 
resume. In other words the network performance degrades gracefully during network load saturation and 
does not crash, rather normal operation is resumed when overall load returns to normal. 

Viewed from another aspect the invention enables to be provided a computer capable of operating in a 
network of similar such computers, the computer comprising: means identifying computers in the network to 

20 which the computer is logically linked in a tree structure; means for storing information regarding the load 
on the computer and at least some of the other computers in the network; means for sending said 
information to the computers to which the computer is logically linked in the tree structure; means for 
receiving similar information from said computers to which the computer is logically linked and updating the 
stored information in accordance therewith, means for selecting one of the other computers in the network 

25 having spare load capacity using said information and transferring tasks to the selected computer. 

The computer preferably also includes means for accessing a stored list of all the computers in the 
network; means for selecting a computer from this list as a candidate neighbouring computer for linking in 
the tree structure; means for sending a message to the selected computer indicating a link request; means 
for establishing a logical link with the selected computer by updating the identifying means if a positive 

30 response from the selected computer is received; means for receiving messages from other computers 
indicating that a link is requested; means for determining, on receipt of such a link request message, 
whether or not there is capacity for establishing a downward link and sending a positive response to the 
sender of the link request message if there is such capacity and updating the identifying means 
accordingly. 

35 There is also provided a network of such computers. 

An embodiment of the invention will now be described, by way of example only, with reference to the 
accompanying drawing, wherein: 

Figures la, lb, 1c and Id are flow diagrams showing the request and receive phases of the tree 
generation; 

40 Figures 2a, b and c show stages of the tree generation according to the present invention; 

Figures 3a, b and c illustrate an example of tree maintenance in the embodiment of the invention; 
Figures 4a, b and c illustrate the handling of clustering of nodes in the embodiment of the invention. 
This embodiment of the invention is composed of two main parts. The first part is tree building and 
maintenance, and the second part is the exchange and maintenance of load balancing infonmation. 

45 

TREE BUILDING AND MAINTENANCE 

Each computer in the network is assigned a unique rank which is used in building the tree. In the 
following the term "parent" will be used to refer to a computer of lower rank to which a computer can form 
50 an upward link and the term "child" will be used to refer to a computer of higher rank to which a computer 
can form a downward link. 

Each computer has stored in it a configuration file that contains 

(i) the maximum number of direct descendants the computer can accept. These are referred to as 
children slots. 

55 (ii) a list of computers and their respective ranks. 

(iii) an ordered list of "favoured parent" nodes. FP, Favoured parents have lower rank than the current 
computer. The FP list is used by the computer to choose a parent node. The FP list is optional. It can be 
empty in which case a parent is chosen at random. The FP list includes a subset of all computers of 
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lower rank. 

(iv) the time to wait for parent and children responses. 

(v) the tinne to wait to probe if a node is dead or alive. 

A list of candidate parents for a computer CR, which will be referred to as the candidates range of P, 
5 limits the range of possible parents. Initially CR of each computer contains all the computers with rank less 
than the rank of the computer. A node is chosen as a parent only if it is in CR and in FP. If FP is empty or 
exhausted then only membership of CR is checked. 

Each computer performs the following steps to build the tree. There is a request phase in which a 
computer chooses another computer and requests it to become a parent, and a receive phase in which the 
;o computer waits for a reply and responds to requests from other computers. 

Request Phase 

1. Each computer i picks another computer from its list of favoured parents FPj. If FPj is empty or all 
75 candidates in FPj failed, a computer with lesser rank is chosen at random. If the chosen computer j Is not 

in CR| then the next one in FP| is chosen or another is randomly picked. If all nodes were tried 
unsuccessfully then the computer enters the receive phase. 

2. i sends the chosen computer j a request_parent message and waits for messages from other 
computers. 

20 3. If the candidate parent does not respond within a certain period of time, it is marked dead and i 
reenters the request phase. 

Receive Phase 

25 1 . If a request parent message arrives from m then: 

(a) If i has a free child slot, it accepts m's request and sends back an ack message 

(b) If i has exhausted its allotted number of child slots then: 

1) i scans its children for a late child as defined below. If one is found then 
(a) the child is marked as dead and removed 

30 (b) this makes a child slot available 

(c) i sends m an ack message 
If a late child is not found then 

2) i finds the child k with highest rank, 

3) If rank(k) > rank(m) then: 

35 a) i accepts m as a child in place of k and sends m an ack message. 

b) i sends k a disengage message containing rank(m) as a parameter. 

4) If rank(k) < rank(m) then: 

a) i locates in its list of computers the computer n with the lowest rank satisfying rank(n) > 
rank(t). 

40 b) i sends m a noack message containing rank(n) as a parameter. 

2. If a disengage message with parameter r arrives then: 

a) i clears its parent field 

b) i removes the computer with ranks outside the range [rank(i) - 1 ,r] from CRi and starts looking for a 
new parent ie enters request phase. 

45 3. If ack arrives from candidate parent the computer records the candidate parent as a true parent. 
4. If a noack message with parameter r arrives then: 

(a) i clears its candidate parent field 

(b) the computer executes step 2(b) above. 

Rgure la is a flow diagram showing the request phase of the tree generation and Rgures lb, 1c and Id 
50 are flow diagrams showing the receive phase of the tree generation. 

As an example of tree generation, consider 5 computers with ranks 0 to 4 where each can accept two 
children. Due to random choice of parents the method can reach the stage shown in Figure 2a. 

Computer 1 can choose only computer 0 as a parent but computer 0 has already two children. 
Therefore, the method replaces child 3 of computer 0 with computer 1 and sends computer 3 a disengage 
55 message with parameter 1. Thus the situation pictured in Figure 2b is reached. 

Node 3 searches for a parent in the range 2 to 1 and picks, say 1 as parent. The resulting tree appears 
in Figure 2c. 

If all nodes are alive and there are no late children then the tree generation terminates with one tree. 
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The above tree generation may build unbalanced trees or trees with many nodes having only one child. 
However, it has been found that the load balancing method Is Insensitive to the topology of the tree. 

Tree maintenance is required to detect dead nodes, to add new nodes, or to reconnect rebooted nodes. 
It is necessary because nodes fall and nodes are sometimes shut down for maintenance. In this 
5 embodiment of the Invention such cases are handled in the following way. 

1. Each computer periodically sends an update_up message to its parent and waits for an 
update_down message. 

2. If no response arrives within a specified period of time the parent is marked dead and the computer 
-- - looks for a new parent. 

70 3. If an update_up message arrives from a child the node responds with an update_down message. 

4. If no update_up message arrives from a child within a specified period of time the child is marked 
late. 

5. If an update up message arrives from a late child then the late flag for that child is reset and the 

node responds with an update down message. 

75 6. When a computer is rebooted or added to the network it looks for a parent, ie it enters the request 
phase. 

7. Each computer flags computers already tried and those marked dead. Periodically these flags are 
cleared, and CR Is reset to contain all computers with lesser rank. Thus rebooted nodes can be probed 
again. 

20 As an example, consider a tree of 1 8 nodes labelled a to r as shown in Figure 3a. 

If node h fails, then after a predefined period of time, its parent c will mark it as late and ignore it since 

it did not receive an update up message. Nodes k and r will start looking for new parents since no 

update down message arrived. The network now comprises a number of separate trees as shown in Figure 

3b. 

25 k and r look for a new parent In their FP. If their FP is empty they choose a parent at random. The new 
tree may look like the one in Rgure 3c, with node k having node g as its new parent and node r having 
node c as its new parent. 

The tree generation together with the tree maintenance mechanism ensure that the computers In the 
network generally will be arranged in a single tree structure. To see that this is true In the case of a node 
30 failing consider an arbitrary tree in which node k fails or is otherwise inoperational. k's parent will mark It as 
late and k's children will mark it as dead. Assume m was a child of k. Therefore, m looks for a new parent. 
There are three possibilities: 

1 . m finds a node which can accept it as a child. Thus, a single tree is formed. 

2. m picks k's parent as a candidate and there are no free slots. In this case the slot of k marked as late 
35 is used and k is marked as dead. Again, a single tree is formed. 

3. m does not find a node which can accept it as child. 

This can happen only If the nodes which would have accepted it (there is at least one, Ie m-l) are 
marked as already tried or as dead. In this case the network temporarily comprises a number of sub- 
trees rather than a single tree. This state will end when all these flag's are reset (step 7 of the 
40 maintenance mechanism) and then m will try again and will find a parent. 
The above reasoning is applicable for all the children of k. 

EXCHANGE AND MAINTENANCE OF INFORMATION 

45 The number of jobs on the run queue is generally accepted as a suitable load metric. A node is 

considered overloaded if the length of the run queue exceeds a predefined threshold, overload mark. 

Since the quantity that is really of interest is how far the node is from its overload mark, a slightly different 
metric is used in this embodiment of the invention. In this embodiment of the invention load is defined as 
length_of_run - queue - overload_mark. For example, if overload_markA = 10 and overload^marke = 

50 8 and the length of run queue of nodes A and B are 5 and 4 respectively, then the loadA = -5 and loads 
= -4. Thus, the load of B is higher than that of A. 

In the following this metric will be used for load and whenever the term load is used, 
length_of__run_queue - overload_mark is meant. 

Each computer in the network stores a sorted vector which holds information on other computers in the 

55 network. Each entry in the vector contains: 

1 ) the rank of the computer 

2) the local load of the computer 

3) the distance in terms of the number of edges or links the load information has traversed 
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4) the last node that propagated this information (last node field) 

The length of the load vector, ie the number of entries it contains, may vary from node to node. 
Load Information is distributed over the network in the following way. 

1. Periodically each computer samples its load L, sorts its load vector using the new value of L, and 
5 sends the load vector to its parent in an update up message. The update up message is also used as 

the indication that the node Is operational for the purposes of tree maintenance. 

2. When an update up message is received by node m from node k, 

, a The distance of each entry in the received load vector is incremented by 1. 

b All entries in the received load vector which originated from receiving node m are deleted. 
10 c AH entries in the load vector which originated from the sending node k are deleted. 

d The distances of entries (computers) which appear in both the local vector and the received vector 
are compared: 

1. If the distance of local entry is greater or equal to the distance of the received entry then the 
local entry is deleted. 

15 2. If the distance of the received entry Is greater than the distance of the local entry then the 

received entry is deleted, 
e The last node field in the entries of the received vector is set to be the sending node k. 
f The received load vector is merged with the local load vector. 

3. A parent sends its new load vector to the child who sent the update up message. This message is an 

20 update down message which again is used as an indication that the parent is operational in the tree 

maintenance message. 

4. When an update down message arrives, the receiving child processes the received load vector in the 

same way as described in step 2. 

The update__down messages propagate information received by each node from the nodes to which 
25 they are linked in the tree structure. Thus children share information by passing it up to the parent who 
passes it down to the other children. By propagating load information up and down the tree, information on 
under -loaded nodes in another sub -tree can reach any node. Due to the fixed size of the load vector, 
each computer holds information on a sub -set of the computers that are least loaded. 

The load information in a load vector is only an approximation of the real load on any particular 
30 computer, since by the time it reaches another node the load of the originating computer may have 
changed. 

The 'update interval* is defined as the time interval between successive load distributions; ie - 

update up messages. This interval significantly influences the results of load balancing. Too large a value 

renders the load information of the current node in other nodes inaccurate. Migration decisions based on 

35 this information result in many rejections due to the inaccuracy of the information. Too small a value; ie a 
higher update rate, increases the overhead of the method and response time is degraded. The update rate 

(1 /update interval) should be proportional to the rate of generation of local processes and should change 

as this rate changes. In addition there is another factor which has to be considered. It has to be ensured 
that the update rate of the current node does not slow down the propagation of information from Its children 

40 to its parent and vice versa. This wills happen if the rate of the current node is low and that of its children or 
its parent is high. Therefore, the update rates of the immediate neighbours have also to be considered. 

The effective_rate of a node is defined as the result of a rate calculation function based on these two 
factors. 

In this embodiment of the invention the following function is used. Initially the effective rate of each 
45 node is set to the birth rate of local processes. For each node p the following quantities are computed: 

a) ratei = local_birth_ratep 

Local birth ratep is computed as an average over a sliding time window. 

b) ratea = max{effective raten}/a 

for all immediate neighbours n of p. or is an attenuation factor determined by computer simulation 
50 experiments as described below. 

The effective rate of p is given by: 

effective_ratep = max{ratei .ratea} 

55 In computer simulation experiments it was found that values of a in the range 1 A to 2.0 gave good 
results for networks with evenly distributed load and networks with regions of high load. 

When a computer is overloaded it searches its local load vector for the first entry indicating a computer 
that can accept extra load, and transfers one or more tasks directly to it. The load entry of the target 
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computer in the load vector is updated and the load vector is re - sorted. If the target computer cannot for 
whatever reasons accept extra load, the sender is notified. In this case the sender deletes that entry from 
the load vector, retrieves the next entry and use It as a new target. If no node can accept extra load then no 
migration is initiated. 

6 Thus, the location policy; ie, finding a target node, depends on the sort criterion used to sort the load 
vector. Since load information is propagated in hops over the tree, the farther away a node is, the less 
accurate is its information. Therefore, in this embodiment of the evaluation the load vector is sorted 
according to load and distance. Entries are first sorted according to their load. Entries with equal load are 
then sorted in ascending order of distance. . - 

10 A node which is lightly loaded for a long time may appear as the first entry In the load vector of many 
nodes. This may cause many nodes to migrate processes to that node, thus flooding it and creating a 
bottleneck. Two approaches can be used to avoid this problem. 

An equivalence set of nodes can be defined in the load vector as those nodes with the same load and 
distance. For example all nodes with load - 1 and distance 2 are In the the same equivalence set. In this 

75 case, the sort criterion ensures that all the nodes in the same equivalence set are grouped together in the 

vector. Before a load vector is distributed (by update up or update down message) the first equivalence 

set in the vector undergoes permutation. This reduces the probability that the same node wills appear in 
many load vectors. 

In addition, whenever a lightly - loaded node appears first in its own load vector for the first time, a 
20 counter is attached to the entry and initialised to be equal to the load of the lightly loaded node which is 
always less than zero. Whenever the vector is distributed, if that entry is still first, then the counter Is 
incremented. If the counter becomes positive that entry is removed from the vector. Thus the number of 
nodes which are aware of the lightly loaded node is limited and the probability of flooding reduced. 

Initiation of process migration depends on the frequency the local load is checked and on the value of 
25 the local load. The decision when to check a node's load status depends on the chosen policy. There are a 
number of possible policies, including the following 

1 . Periodic the local load is tested after sending update_up message, ie, the local load sampling 

is synchronised with the update_up message. 

2. Event driven The local load is tested whenever a local process is added to the run queue. 

30 The decision when to declare a computer overloaded, ie where to place the overload mark, influences 
the performance of the method. Placing the overload mark too low may cause too many migrations without 
improving response time. Placing it too high may degrade response time. The overload mark value needs to 
be optimised for any particular system and the value may be different for computers in the network having 
different load capacities or for different job type profiles. The optimum value or values can be established in 

35 any given situation by the use, for example, of appropriate computer simulation experiments or by a 
process of trial and error. 

A controlled clustering of nodes can be achieved by an appropriate assignment of rank in the 
configuration files of each node. 

For example, if there are eleven computers a to k and it is required to form clusters of four and seven 
40 computers as follows, 

abed e f )i h i j 

1st cluster 2nd clii.<;t;or 

45 

the following steps are performed to rank the nodes. 
1 . Each cluster is ranked separately. For example 

50 

abed efgb ijk 

1203 560123 A 

55 2. Each cluster is taken In turn, and rank of each node is increased by the number of nodes in previous 
clusters. For example if the 2nd cluster is chosen first. 
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abed fi f ' J 

8 9 7 10 5 r> 0 1 7. 3 A 

5 

Cluster configuration is achieved by assigning different ranking files to the computers. Each computer 
except the one with lowest rank in a cluster, stores a ranking list that contains ranks of nodes in Its cluster 
and not those outside the cluster. The computer with the lowest rank stores, the ranking list with* nodes in 
other, clusters. . - - 
10 Nodes in a cluster will therefore connect to each other since they are not aware of other nodes. Only 
the node with the lowest tank will connect to a node in another cluster, because it cannot connect to another 
node in the cluster. Figure 4a shows a sample tree formed from the nodes in the above example. 

If the maximum number of children for c is three, and the favoured parent lists of a. b and d contain 
only c, the tree described in Figure 4b may result. 
75 Placing j in the favoured parent list of c may generate the tree described in Figure 4c. 

Thus clustering is achieved in the embodiment of the invention by judicious choice of ranking files and 
favoured parents lists for each node. 

Whilst the invention has been described in terms of an embodiment using a sender -initiated load 
balancing policy in which an overloaded node initiates the transfer of tasks, it will be clear to those skilled in 
20 the art that the invention could equally well be applied to a receiver - initiated scheme in which a lightly 
loaded computer initiates transfer of tasks from a heavily loaded computer to itself. 

There has been described a method of operating a computer in a network and operating a network of 
computers using an improved load balancing technique which is scalable, fault tolerant, flexible and 
supports clustering thus making it suitable for use in networks having very large numbers of computers. 

25 

Claims 

1. A method of operating a computer in a network of computers, the method comprising: 

generating logical links between the computer and other computers in the network so that a tree 
30 structure is formed, the computer being logically linked to one computer higher up the tree and a 
number of computers lower down the tree; and 

maintaining in the computer stored information regarding the current load on the computer and the load 
on at least some of the other computers in the network by causing the computer periodically to 
distribute the information to the computers to which it is logically linked, to receive from said computers 
35 similar such information and to update its own information in accordance therewith, so that the 
information can be used to determine computers in the network that can accept extra load, 

2. A method as claimed in claim 1 comprising the step of, when the computer is overloaded, using the 
information to determine a computer that can accept extra load and transferring at least one task to that 

40 computer. 

3. A method as claimed in claim 1 or claim 2 comprising, if the computer higher up the tree, to which the 
computer is logically linked fails or is otherwise inoperative, generating a new logical link to another 
computer in the network which has capacity for accepting new downward links. 

45 

4. A method as claimed in any preceding claim wherein the periodic distribution of load information is 
used by the computer to determine whether or not the computers to which it is linked in the tree, are 
operational, the computer being marked, if one of the computers lower in the tree to which it is linked is 
not operational, as having capacity to accept new downward links. 

50 

5. A method as claimed in any preceding claim wherein the information stored in the computer contains a 
number of entries, each entry containing information regarding 

(i) the load on one of the computers in the network, 

(ii) the number of links in the tree separating that other computer from the computer, and 

55 (iii) the one of the computers, which are linked to the computer, from which the entry was last 

received: 

and wherein when the computer receives the similar information from a computer to which it is linked 
the following steps are performed: 
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(a) the number of links value in each entry of the received information is incremented by one; 

(b) entries in the received information which originated from the computer are deleted; 

(c) entries in the information already stored in the computer which were received from the sending 
computer are deleted; 

5 (d) the received information is merged with the information already stored in the computer; 

(e) the merged information is sorted in ascending order of load, entries with equal load being sorted 
in ascending order of number of links separation from the computer. 

6. A method as claimed in claim 5 wherein before each distribution of information entries in the 
10 information relating to computers which have the same load _and number of links separation from the 

computer are randomly permuted, ^ 

7. A method as claimed in claim 5 or claim 6 wherein if an entry corresponding to the computer appears 
first in its own sorted information a counter is attached to the entry and initialised to negative value of 

75 the spare load capacity of the computer, the counter being Incremented, if that entry is stiH first, 
whenever the information is distributed, and if the counter becomes positive that entry is removed from 
the information, whereby the probability that the same entry will appear first in the information stored in 
different computers is reduced, 

20 8. A method as claimed in any preceding claim wherein the period of the periodic distribution of the 
information to a computer higher up the tree depends on the rate at which new tasks are created in the 
computer 

9. A method as claimed in claim 8 wherein the period of the periodic distribution depends on the rate at 
25 which new tasks are created in the computers to which the computer is logically linked in the tree 

structure. 

10. A method as claimed in any preceding claim wherein the distribution of the information to a computer 
lower in the tree takes place in response to receipt by the computer of the similar information from the 

30 computer lower in the tree. 

11. A method of operating a network of computers, comprising operating each computer using a method as 
claimed in any preceding claim. 

35 12. A method as claimed in claim 11 wherein the generation of the logical links is achieved by assigning a 
rank to each computer, no two computers being assigned the same rank, each computer being linked 
to one computer of tower rank and a number of computers of higher rank to form the tree structure. 

13. A computer capable of operating in a network of similar such computers, the computer comprising: 

40 means identifying computers in the network to which the computer is logically linked in a tree structure; 
means for storing information regarding the load on the computer and at least some of the other 
computers in the network; 

means for sending said information to the computers to which the computer is logically linked; 
means for receiving similar information from said computers to which the computer is logically linked 
45 and updating the stored information in accordance therewith; 

means for selecting one of the other computers in the network having spare load capacity from said 
information and transferring tasks to the selected computer. 

14. A computer as claimed in claim 13 including: 

50 means for accessing a stored list of all the computers in the network; 

means for selecting a computer from this list as a candidate neighbouring computer for linking in the 
tree structure; 

means for sending a message to the selected computer indicating a link request; 
means for establishing a logical link with the selected computer by updating the identifying means if a 
55 positive response from the selected computer is received; 

means from receiving messages from other computers indicating that a link is requested; 

means for determining, on receipt of such a link request message, whether or not there is capacity for 

establishing a downward link and sending a positive response to the sender of the link request 
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message if there is such capacity and updating the identifying nneans accordingly. 
15. A network of computers connprising a plurality of computers as clainned in clainn 13 or claim 14. 

5 

10 

15 
20 
25 
30 
35 
40 
45 
50 



12 



EP 0 540 151 A2 



V PHASE 



SELECT A NODE IN CRI^FPI 
IF NONE FOUND SELECT ONE 
WITH LOWER RANK AT RANDOM 




FOUND \ NO 
.SUCH NODE 



YES 



SEND THE SELECTED NODE A 
*REQUEST_PARENT' MESSAGE 



/^-^CEIVE^N 
V^>HASE^ 



FIG.Ia 



EP 0 540 151 A2 




14 



EP0 540 151 A2 



DISENGAGE 



REPLACE k WITH m 






SEND ACK TO m | 






SEND TO k DISENGAGE 
MESSAGE WITH 
PARAMETER RANK (m) 


\ 





V ^PHASE ^ FIG.1C 



C^ACK^ 



1 




LOCATE NODE n WITH 
LOWEST RANK SATISFYING 
RANK(n)> RANKIi ) 






SEND TO m NOACK 
MESSAGE WITH PARAMETER 
RANK (n) 







RECEIVE ^ 

PHASF^ FIG. 1d 



15 



EP0 540 151 A2 




FIG.2C 



16 



EP 0 540 151 A2 





17 



EP 0 540 151 A2 




FIG.Aa 



(d,10) 




FIG.^b 




® 



® 



J 



Europalsches Patentamt 
European Patent Office 
Office europeen des brevets 



© Publication number: 
EUROPEAN PATENT APPLICATION 



0 540 151 A3 



0 Application number:. 923081 37.6 - 


0 int. ci.5:G06F 9/46 ~ ^ 


\^ Date of filing: 08.09.92 




® Priority: 31.10.91 IL 99923 


49/8 Meir Nakar Street 


@ Date of publication of application: 


Jerusalem(iL) 


Inventor: Bach, Moshe 


05.05.93 Bulletin 93/18 


Trun^peldor Street 5a 




Haifa(IL) 


@ Designated Contracting States: 


Inventor: MoattI, Yosef 


DE FR GB 


68/56 Hanita Street 




Haifa(IL) 


® Date of deferred publication of the search report: 


Inventor: Teperman, Abraham 


13.10.93 Bulletin 93/41 


46 Ha viva Reich Street 




Haifa(IL) 


0 Applicant: International Business Machines 




Corporation 


0 Representative: Burt, Roger James, Dr. 


Old Orchard Road 


IBM United Kingdom Limited 


Armonk, N.Y. 10504(US) 


Intellectual Property Department 


0 Inventor: Allon, David 


Hursley Park 


Winchester Hampshire S021 2JN (GB) 



0 Method of operating a computer In a network. 



@ A method is described of operating a computer 
in a network of computers using an improved load 
balancing technique, the method comprising: gen- 
erating logical links between the computer and other 
computers in the network so that a tree structure is 
formed, the computer being logically linked to one 
computer higher up the tree and a number of com- 
puters lower down the tree; maintaining in the com- 
puter stored information regarding the current load 
on the computer and the load on at least some of 
^ the other computers in the network by causing the 
^ computer periodically to distribute the information to 
the computers to which it is logically linked, to 
jJJ receive from said computers similar such information 
^ and to update its own information in accordance 
therewith, so that the information can be used to 
^ determine a computer in the network that can accept 
If) extra load. A sender-initiated embodiment of the 
Q invention includes the further step of, when the com- 
puter is overloaded, using the information to deter- 
& mine a computer that can accept extra load and 
^ transferring at least one task to that computer. The 
load balancing technique is scalable, fault tolerant, 
flexible and supports clustering thus making it suit- 



able for use in networks having very large numbers 
of computers. 




Rank Xerox (UK) Business Services 

(3.10/3.6/3.3.1) 



European Patent 
Office 



EUROPEAN SEARCH REPORT 



Application Number 

EP 92 30 8137 



DOCUMENTS CONSIDERED TO BE RELEVANT 



Catetory 



- Chadoa of documcat wiUi indicatioii, where appropriate, 
of retevapt patsagcs 



Rdevaat 
to daim 



CLASSmCATION OF THE 
APPUCATION a". CLS ) 



IEEE TRANSACTIONS ON AEROSPACE AND 
ELECTRONIC SYSTEMS 

vol. 26, no. 3, May 1990, NEW YORK US 
pages 511 - 516 

Y.C.CHENG ET AL. 'Distributed Computation 
for a Tree Network with Communication 
Delays' 

* figure 1 * 

* page 511, right column, line 1 - line 27 



SYSTEMS & COMPUTERS IN JAPAN 

vol. 19, no. 6, June 1988, SILVER SPRING, 

MD, US 

pages 35 - 48 

M.TAKESUE 'A Distributed Load-Balancing 
System and its Application to 
List-Processing Oriented Data-Flow Machine 
OFM' 

* page 35, left column, line 1 - page 36, 
left column, line 54 * 

* figure 1 * 

IEEE TRANSACTIONS ON COMPUTERS 

vol. 41, no. 3, March 1992, NEW YORK US 

pages 257 - 273 

N.S.BOWEN ET AL. 'On the Assignment 
Problem of Arbitrary Process Systems to 
Heterogenous Distributed Computer Systems' 

* abstract * 

* page 265, right column, line 1 - page 
267, right column, line 10 * 

* figures 1,9,12 * 



1,13 



G06F9/46 



1.13 



TECHNICAL FIELDS 
SEARCHED (iBt. CLS ) 



1,13 



G06F 



Tbc prevent soucfa rcf ort has beeo drawn up for all claims 



nacatriorcfc 

THE HAGUE 



19 AUGUST 1993 



SCHARFENBERGER B. 



CATEGORY OF QTED DOCUMENTS 

X ; partf oulariy rdcvaal If (akco &Jm« 

Y : pvtioilarfy rricvmat If coablacd with aaotb«r 

documcat of tfa« umi category 
A : tochnologiaJ back(rouD4 
O : aoo-iniitai dtsdocun 
P : intenncdiatt documcnl 



T ; Aearj or priodpla uodcriylng the Invnttoo 
E : eariior ^tcnt doaimeot, but pabttsbvl oo, or 

tftcr th* niisg 4tt« 
D : locuiacBt dt«4 la tha applkadoo 
L : rfociuaAt dted for otb«r reasons 

A : ta«Bbcr of th« sum patooi faialty, correspoaling 



(19) 



J 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 



(12) 



(45) Date of publication and mention 
of the grant of the patent: 
25.11.1998 Bulletin 1998/48 

(21) Application number; 92308137.6 

-(22) Date of filing: 08.09.1992 



(11) EP 0 540 151 B1 

EUROPEAN PATENT SPECIFICATION 

(51) lntCI.6: G06F9/46 



(54) Method of operating a computer in a network 

Verfahren zum Betrieb sines Rechners in einem Netz 
Methode d'operatlon d'un ordinateur dans un reseau 



(84) Designated Contracting States: 
DE FRGB 

(30) Priority: 31.10.1991 IL 99923 

(43) Date of publication of application: 
05.05.1993 Bulletin 1993/18 

(73) Proprietor International Business Machines 
Corporation 

Armenia, N.Y. 10504 (US) 

(72) Inventors: 

• Alien, David 
Jerusalem (IL) 

• Bach, Moshe 
Haifa (IL) 

• MoattI, Yoseff 
Haifa (IL) 

• Teperman, Abraham 
Haifa (IL) 



QQ 

o 
o 

Q. 
LU 



(74) Representative: Burt, Roger James, Dr. 
IBM United Kingdom Limited 
Intellectual Property Department 
Hursley Park 

Winchester Hampshire S021 2JN (GB) 

(56) References cited: 

• IEEE TRANSACTIONS ON AEROSPACE AND 
ELECTRONIC SYSTEMS vol. 26, no. 3, May 1990, 
NEW YORK US pages 511 - 516 Y.C.CHENG ET 
AL. 'Distributed Computation for a Tree Network 
with Communication Delays' 

• SYSTEMS & COMPUTERS IN JAPAN vol. 1 9, no. 
6, June 1988, SILVER SPRING, MD, US pages 35 
- 48 M.TAKESUE 'A Distributed Load-Balancing 
System and Its Application to List-Processing 
Oriented Data-Fiow Machine DFM' 

• IEEE TRANSACTIONS ON COMPUTERS vol. 41, 
no. 3, March 1992, NEW YORK US pages 257 - 
273 N.S.BOWEN ET AL. 'On the Assignment 
Problem of Arbitrary Process Systems to 
Heterogenous Distributed Computer Systems* 



Note: Within nine months from the publication ot the mention of the grant of the European patent, any person may give 
notice to the European Patent Office of opposition to the European patent granted. Notice ot opposition shall be filed in 
a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 
99(1) European Patent Convention). 



Printed by Joiive, 75001 PARIS (FR) 



EP 0 540 151 B1 



Description 

The Invention relates to nnethods of operating connputers in networks and to computers capable of operating in 
networks. 

Local area networks of connputers are an indispensable resource in many organisations. One problem in a large 
network of computers is that it is sometimes difficult to make efficient use of all the computers in the network. Load 
sharing or balancing techniques increase system throughput by attempting to keep ail computers busy. This is done 
by off-loading processes from overloaded computers to idle ones thereby equalising load on all machines and mini- 
mising overall response time. 

- Load balancing methods can be classified according to the method used to achieve balancing. They can be 'static' 
or 'dynamic' and central' or 'distributed*. 

In static load balancing, a fixed policy - deterministic or probabilistic - is followed independent of the current system 
state. Static load balancing is simple to implement and easy to analyse with queuing models. However, its potential 
benefit is limited since it does not take Into account changes in the global system state. For example, a computer may 
migrate tasks to a computer which is already overloaded. 

in dynamic schemes the system notes changes in Its global status and decides whether to migrate tasks based 
on the current state of the computers. Dynamic policies are inherently more complicated than static policies since they 
require each computer to have knowledge of the state of other computers in the system. This information must be 
continuously updated. 

In a central scheme one computer contains the global state information and makes all the decisions. In a distributed 
scheme no one computer contains global information. Each computer makes migration decisions based on the state 
information it has. 

Load-balancing methods can be further classified as 'sender-initiated' and 'receiver-initiated'. In sender-initiated 
policies an overloaded computer seeks a lightly loaded computer. In receiver-initiated policies lightly loaded computers 
advertise their capability to receive tasks. It has been shown that if costs of task-transfer are comparable under both 
policies then sender-initiated strategies outperform receiver-initiated strategies for light to moderate system loads^ 
while receiver-initiated schemes are preferable at high system loads. 

Conventionally, a dynamic load balancing mechanism is composed of the following three phases: 

1 . Measuring the load of the local machine. 

2. Exchanging local load infonmation with other machines in the network. 
3- Transferring a process to a selected machine. 

Local load can be computed as a function of the number of jobs on the run queue, memory usage, paging rate, 
file use and I/O rate, or other resource usage. The length of the run queue is a generally accepted load metric. 

The receipt of load information from other nodes gives a snapshot of the system load. Having this information, 
each computer executes a transfer policy to decide whether to run a task locally or to transfer it to another node. If the 
decision is to transfer a task then a location policy is executed to determine the node the task should be transferred to. 

If a load balancing method is to be effective in networks having large numbers of computers it must be scalable, 
in the sense that network traffic increases linearly with the size of the networK flexible so that computers can be easily 
added to or removed from the network, robust in the event of failure of one or more of the computers and must support 
to some degree clustering of the computers. 

Centralised methods, though attractive due to their simplicity are not fault -tolerant. The central computer can detect 
dead nodes and it takes one or two messages to re-establish connection. However, if the central node fails, the whole 
scheme falls apart. Load balancing configuration can be re-established by maintaining a prioritised list of alternative 
administrators in each node or by implementing an election method to elect a new centralised node. When a central 
node fails, other nodes switch to the new central node. This enhancement, however, increases the management com- 
plexity and cost of the method. 

In addition, management overhead for a centralised method is unacceptably large. As the number of nodes in- 
creases, the time spent by the central node in handling load information increases, and there must come a point at 
which it will overload. 

Several prior art load balancing methods are both dynamic and distributed. 

For example, in the method described in Barak A. and Shiloh A. SOFTWARE - PRACTICE AND EXPERIENCE, 
15(9): 901 -91 3, September 1985 [R1] each computer maintains a fixed size load vector that is periodically distributed 
between computers. Each computer executes the following method. First, the computer updates its own load value. 
Then, it chooses a computer at random and sends the first half of its load vector to the chosen computer. When receiving 
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a load vector, each computer merges the information with Its local load vector according to a predefined rule. 

A problem with this method is that the size of the load vector has to be chosen carefully. Finding the optimal value 
for the load vector is difficult and it must be tuned to a given system. A large vector contains load information for many 
nodes. Thus, many nodes will know of a lightly loaded node, Increasing the chance that it will quickly receive many 
migrating processes and will overload. On the other hand, the size of the load vector should not be so small that load 
values do not propagate through the network in a reasonable time. For large number n of nodes in the network, the 
expected time to propagate load information through the network is 0(log n). 

The method described in R1 can handle any number of computers after tuning the length of the load vector. There- 
fore, the method is scalable. However, it is flexible only up to a point. Adding a small number of nodes to the network 
does not require any change. Adding a large number of nodes requires changing the size of the load vectorjn all 
" computers, thus increasing the administrative overhead. The method Is fault-tolerant and continues to work in spite of 
single failures, however there is no built-in mechanism tor detecting dead nodes and updating information on them in 
the load vectors. Thus information about failed nodes is not propagated and other nodes may continue to migrate 
processes to them, with the result of decreased response time. 

The number of communications per unit time is 0{n). However, the method does not support clustering. When a 
node wants to off-load processes to another node it chooses a candidate from Its load vector. Since the information 
on nodes in the load vector of node p is updated from random nodes, the set of candidates for off-loading is a random 
subset of n and not a controlled set for p. 

In Lin F. C. H. and Keller R. M. PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON DISTRIB- 
UTED COMPUTING SYSTEf^S, 329-336. May 1 986 [R2], a distributed and synchronous load balancing method using 
a gradient model is disclosed. Computers are logically arranged in a grid and the load on the computers is represented 
as a surface. Each computer interacts solely with its immediate neighbours on the grid. A computer with high load is 
a 'hiir and an idle computer is a 'valley'. A neutral computer is one which is neither overloaded nor idle. Load balancing 
is a form of relaxation of the surface. Tasks migrate from hills towards valleys and the surface is flattened. Each computer 
computes its distance to the nearest idle computer and uses that distance for task migration. 

An overloaded node transfers a task to its immediate neighbour in the direction of an idle node. A task continues 
to move until it reaches an idle node. If a former idle node is overloaded when a task arrives, the task moves to another 
idle node. 

This method, though scalable, is only partially flexible. It is easy to add or remove nodes at the edge of the grid, 
but it is difficult to do so in the middle since the grid is fixed. To do this, reconfiguration is required, which increases 
the overhead inherent in the method. Detection of dead nodes is not easy. Since only changes of the state of a node 
are sent to its neighbours, a node's failure remains undetected until a job is transferred to it. Late detection delays 
migrating processes and, therefore, increases overall response time. In this method clustering is not supported. Each 
node transfers processes to one of its immediate neighbours which may transfer it in a different direction. The over- 
loaded node has no control as to where the off-loaded processes will eventually execute. Management overhead to 
start the grid is low and the number of messages is 0(n). Administrative overhead for reconfiguration is high since it 
requires changes in all neighbouring nodes for a failed node. In this method node overhead is high and network traffic 
increases because tasks move in hops rather than directly to an idle node and it takes long time to migrate a process. 

The buddy set algorithm proposed in Shin K. G. and Chang Y. C. IEEE TRANSACTIONS ON COMPUTERS, 38 
(8), August 1989 [R3], aims for very fast load sharing. For each node two lists are defined: Its buddy set, the set of its 
neighbours, and its preferred list, an ordered list of nodes in its buddy set to which tasks are transferred. Each node 
can be in one of the following states: under-loaded, medium-loaded and overloaded. Each node contains the status 
of all nodes in its buddy set. Whenever the status of a node changes, it broadcasts Its new state to all nodes in its 
buddy set. The internal order of preferred lists varies per node in order to minimise the probability that two nodes in a 
buddy set will transfer a task to the same node, when a node is overloaded it scans its preferred list for an under- 
loaded node and transfers a task to that node. Overloaded nodes drop out of preferred lists. Buddy sets and their 
preferred lists overlap, so processes migrate around the network. 

Again, this method, though scalable in that the number of messages increases linearly with the number of com- 
puters in the network n, is not flexible. Adding or removing nodes from the network requires recomputation of the 
preferred lists in all nodes of the related buddy set. If a node is added to more than one buddy set the recomputation 
must be done for each node in each buddy set. Detection of dead nodes is difficult for the same reason as in the 
gradient model described in R2. Clustering is supported but reconfiguration is expensive since it requires recomputation 
of new buddy sets and preferred lists, as for adding and removing nodes. The administrative overhead inherent in the 
method is therefore high. 

Cheng Y-C. and Robertazzi T. G. in IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTElVlS, 
vol. 26, no. 3, 511-516, May 1990 [R4], disclose a method of operating a computer in a network of computers, the 
method comprising the step of generating logical links between the computer and other computers in the network so 
that a tree structure is formed. The logical links are such that the computer is logically linked to one computer higher 
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up the tree and a number of computers lower down the tree. 

According to a first aspect of the present invention there Is provided a method of operating a computer in a network 
of computers characterised by: 

maintaining in the computer stored information regarding the current load on the computer and the load on at least 
5 some of the other computers in the network by causing the computer periodically to distribute the information to the 
computers to which it Is logically linked, to receive from said computers similar such information and to update its own 
information in accordance therewith, so that the information can be used to determine computers in the network that 
can accept extra load; wherein the information stored in the computer contains a number of entries, each entry con- 
taining information regarding: 
JO - . - „ _ ^ _ - - - ^ - 

(i) the load on one of the computers in the network, 

(ii) the number of links in the tree separating that other computer from the computer, and 

IS (iii) the one of the computers, which are linked to the computer, from which the entry was last received; 

and wherein when the computer receives the similar information from a computer to which it is linked the following 
steps are performed: 

20 (a) the number of links value In each entry of the received information is incremented by one; 

(b) entries in the received information which originated from the computer are deleted; 

(c) entries in the information already stored in the computer which were received from the other computer are 
25 deleted; 

(d) the received information is merged with the information already stored in the computer; 

(e) the merged information is sorted in ascending order of load, entries with equal load being sorted in ascending 
30 order of number of links separation from the computer. 

The invention is applicable to both sender-initiated and receiver-initiated load balancing. An embodiment of the 
invention using a sender-initiated load balancing technique includes the further step of, when the computer is over- 
loaded, using the information to determine a computer that can accept extra load and transferring at least one task to 
35 that computer 

There is further provided a method of operating a network of computers by operating each computer using the 
above method. 

The generation of the logical links can be achieved by assigning a rank to each computer, no two computers being 
assigned the same rank, each computer being logically linked to one computer of lower rankand a number of computers 
40 of higher rank to form the tree structure. 

The tree structure can be maintained if a computer fails, or is otherwise inoperative, by generating new logical 
links between each of the computers lower down the tree to which the failed computer was linked and other computers, 
which have capacity for accepting new downward links. 

An improved load balancing technique is employed which has the property of scalability by limiting the communi- 
^5 cation load on each computer and the knowledge of the network topology which Is required to be stored in each com- 
puter. The actions taken by each computer are thus independent of the total number of computers in the network. 

Since the tree structure is dynamically built and maintained, the method is also flexible. Changing a tree configu- 
ration requires simply changing the ranking of the reconfigured computers. The new ranking files have to be consistent 
with existing ranking files to avoid cycles. A computer is reconfigured by logically disconnecting it so that it appears 
50 inoperative to its neighbours, replacing its configuration file, and letting it reconnect to the tree. A new configuration is 
achieved without disturbing other nodes on the tree and without disrupting local services. 

Detection and reconnection of failed nodes is also provided for. If a computer higher up the tree does not respond 
within a certain time each higher ranked computer previously connected to it tries to relink Itself elsewhere in the tree. 
If it temporarily fails to do this, it acts as root of a disconnected sub-tree, and load balancing occurs within the sub- 
55 tree. Eventually, the flags marking nodes as already tried or dead will be reset, and the computer, which periodically 
tries to attach itself to the tree, will reconnect to the tree. 

Advantageously, the periodic distribution of load information across the links of the tree is used by each computer 
to determine whether or not the computers to which it is linked are operational, each computer seeking, if the computer 
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to which it is linked higher in the tree is not oprational, to generate a new link with a computer having lower rank and 
having the capacity for new downward links and being nnarked, if one of the connputers lower in the tree to which it is 
linked is not operational, as having capacity to accept new downward links. 

Before each distribution of infornnation between computers, entries in the information relating to computers which 
s have the same load and number of links separation from the computer can be randomly permuted, so that the probability 
that the same entry will appear first in the information stored in different computers is reduced. Also, if an entry corre- 
sponding to the computer appears first in its own sorted infonnation a counter can be attached to the entry and initialised 
to the negative value of the spare load capacity of the computer, the counter being Incremented, if that entry is still 
first, whenever the information is distributed, and if the counter becomes positive that entry is removed from the infor- 
10 mation. The probability that the same entry will appear first in the information stored in different computers is thereby 
reduced. _ . - 

In an advantageous form of the invention the period of the periodic distribution of the information to a computer 
higher up the tree is related to the rate at which new tasks are created in the computer and/or the rate at which new 
tasks are created in the computers to which it is logically linked in the tree structure. This has the advantage that if one 
IS or more nodes suddenly become highly loaded this information will spread more quickly throughout the network. 

A further advantage of the method provided in the present invention is that the case where the overall load in the 
network increases steadily is handled gracefully. In such a case, when local load vectors are searched for a target 
computer, fewer candidates will be found and the number of migrations will decrease. When overall load decreases, 
more candidates will be found in the local load vectors and migrations will resume. In other words the network per- 
20 formance degrades gracefully during network load saturation and does not crash, rather normal operation is resulted 
when overall load returns to normal. 

Viewed from another aspect the invention enables to be provided a computer as claimed in claim 12 capable of 
operating in a network of similar such computers. 

The computer preferably also includes means for accessing a stored list of all the computers in the network; means 
2S for selecting a computer from this list as a candidate neighbouring computer for linking in the tree structure; means tor 
sending a message to the selected computer indicating a link request; means for establishing a logical link with the 
selected computer by updating the identifying means if a positive response from the selected computer is received; 
means for receiving messages from other computers indicating that a link is requested; means for determining, on 
receipt of such a link request message, whether or not there is capacity for establishing a downward link and sending 
30 a positive response to the sender of the link request message if there is such capacity and updating the identifying 
means accordingly. 

There is also provided a network of such computers. 

An embodiment ot the invention will now be described, by way of example only, with reference to the accompanying 
drawing, wherein: 

35 

Figures la. lb, 1c and Id are flow diagrams showing the request and receive phases of the tree generation; 

Figures 2a, b and c show stages of the tree generation according to the present invention; 

40 Figures 3a, b and c illustrate an example of tree maintenance in the embodiment of the invention; 

Figures 4a, b and c illustrate the handling of clustering of nodes in the embodiment of the invention. 

This embodiment of the invention is composed of two main parts. The first part is tree building and maintenance, 
^5 and the second part is the exchange and maintenance of load balancing information. 



TREE BUILDING AND MAINTENANCE 



Each computer in the network is assigned a unique rank which is used in building the tree. In the following the 
so term "parent" will be used to refer to a computer of lower rank to which a computer can form an upward link and the 
term "child" will be used to refer to a computer of higher rank to which a computer can form a downward link. 
Each computer has stored in it a configuration file that contains 
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(i) the maximum number of direct descendants the computer can accept. These are referred to as children slots. 

(ii) a list of computers and their respective ranks. 

(iii) an ordered list of "favoured parent" nodes, FP Favoured parents have lower rank than the current computer 
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The FP list is used by the computer to choose a parent node. The FP list is optional. It can be empty in which case 
a parent is chosen at random. The FP list includes a subset of all computers of lower rank. 

(iv) the time to wait for parent and children responses. 

(v) the time to wart to probe if a node is dead or alive. 

A list of candidate parents for a computer CR, which will be referred to as the candidates range of P, limits the 
range of possible parents. Initially CR of each computer contains all the computers with rank less than the rank of the 
computer A node is chosen as a parent only if it is in CR and in FP. If FP is empty or exhausted then only membership 
of CR is checked. - 

Each computer performs the following steps to build the tree. There is a request phase in which a computer chooses 
another computer and requests it to become a parent, and a receive phase in which the computer waits for a reply and 
responds to requests from other computers. 

Request Phase 

1. Each computer i picks another computer from its list of favoured parents FPj. If FPj is empty or all candidates 
in FPj failed, a computer with lesser rank is chosen at random. If the chosen computer j is not in CRj then the next 
one in FPj is chosen or another is randomly picked. If all nodes were tried unsuccessfully then the computer enters 
the receive phase. 

2. i sends the chosen computer j a request_parent message and waits for messages from other computers. 

3. If the candidate parent does not respond within a certain period of time, it is marked dead and i reenters the 
request phase. 

Receive Phase 

1. If a request_parent message arrives from m then: 

(a) If i has a free child slot, it accepts m's request and sends back an ack message 

(b) If i has exhausted its allotted number of child slots then: 

1) i scans its children tor a late child as defined below. If one is found then 

(a) the child is marked as dead and removed 

(b) this makes a child slot available 

(c) i sends m an ack message 

If a late child is not found then 

2) i finds the child k with highest rank. 

3) If rank(k) > rank(m) then: 

a) i accepts m as a child in place of k and sends m an ack message. 

b) i sends k a disengage message containing rank{m) as a parameter. 

4) If rank(k) < rank(m) then: 

a) i locates in its list of computers the computer n with the lowest rank satisfying rank(n) > rank(i). 

b) i sends m a noack message containing rank(n) as a parameter 

2. If a disengage message with parameter r arrives then: 

a) i clears its parent field 
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b) i removes the computer with ranks outside the range [rank{i)-1,r] from CRi and starts looking tor a new 
parent, ie enters request phase. 

3. If ack arrives from candidate parent the computer records the candidate parent as a true parent. 

4. If a noack message with parameter r arrives then: 

(a) i clears its candidate parent field 

(b) the computer executes step 2(b) above. 



Figure la is a flow diagram showing the request phase of the tree generation and Figures lb, 1c and Id are flow 
diagrams showing the receive phase of the tree generation. 

As an example of tree generation, consider 5 computers with ranks 0 to 4 where each can accept two children. 
Due to random choice of parents the method can reach the stage shown in Figure 2a. 
IS Computer 1 can choose only computer 0 as a parent but computer 0 has already two children. Therefore, the 

method replaces child 3 of computer 0 with computer 1 and sends computer 3 a disengage message with parameter 
1. Thus the situation pictured in Figure 2b is reached. 

Node 3 searches for a parent in the range 2 to 1 and picks, say 1 as parent. The resulting tree appears in Figure 2c. 

If all nodes are alive and there are no late children then the tree generation terminates with one tree. 
^0 The above tree generation may build unbalanced trees or trees with many nodes having only one child. However, 

it has been found that the load balancing method is insensitive to the topology of the tree. 

Tree maintenance is required to detect dead nodes, to add new nodes, or to reconnect rebooted nodes. It is 
necessary because nodes fait and nodes are sometimes shut down for maintenance. In this embodiment of the invention 
such cases are handled in the following way. 

25 

1 . Each computer periodically sends an update_up message to its parent and waits for an update_down message. 

2. If no response arrives within a specified period of time the parent is marked dead and the computer looks for a 
new parent. 

30 

3. If an update_up message arrives from a child the node responds with an update_down message. 

4. If no update_up message arrives from a child within a specified period of time the child ts marked late. 

3S 5. If an updale_up message arrives from a late child then the late flag for that child is reset and the node responds 

with an update_down message. 

6. When a computer is rebooted or added to the network it looks for a parent, ie it enters the request phase. 

7. Each computer flags computers already tried and those marked dead. Periodically these flags are cleared, and 
CR is reset to contain all computers with lesser rank. Thus rebooted nodes can be probed again. 

As an example, consider a tree of 1 8 nodes labelled a to r as shown in Figure 3a. 

If node h fails, then after a predefined period of time, its parent c will mark it as late and ignore it since it did not 
-^5 receive an update_up message. Nodes k and r will start looking for new parents since no update_down message 
arrived. The network now comprises a number of separate trees as shown in Figure 3b. 

k and r look for a new parent in their FR If their FP is empty they choose a parent at random. The new tree may 
look like the one in Figure 3c, with node k having node g as its new parent and node r having node c as its new parent. 
The tree generation together with the tree maintenance mechanism ensure that the computers in the network 
so generally will be arranged in a single tree structure. To see that this is true in the case of a node failing consider an 
arbitrary tree in which node k fails or is otherwise inoperational. k's parent will mark it as late and k's children will mark 
it as dead. Assume m was a child of k. Therefore, m looks for a new parent. There are three possibilities: 
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1. m finds a node which can accept it as a child. Thus, a single tree is formed. 

2. m picks k's parent as a candidate and there are no free slots. In this case the slot of k marked as late is used 
and k is marked as dead. Again, a single tree is formed. 
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3. m does not find a node which can accept it as cinild. 

This can happen only if the nodes which would have accepted it (there is at least one, ie m-i) are marked as already 
tried or as dead. In this case the network temporarily comprises a number of sub-trees rather than a single tree. 
This state will end when all these flags are reset (step 7 of the maintenance mechanism) and then m will try again 
and will find a parent. 

The above reasoning is applicable for all the children of k. 
EXCHANGE AND MAINTENANCE OF INFORMATION 

The number of jobs on the run queue is generally accepted as a suitable load metric. A node is considered over- 
loaded if the length of the run queue exceeds a predefined threshold, overload_mark. Since the quantity that is really 
of interest Is how far the node is from its overload mark, a slightly different metric is used in this embodiment of the 
invention. In this embodiment of the invention load is defined as length_of_run-queue - overload_mark. For example, 
if overload_markA = 10 and overload.markg = 8 and the length of run queue of nodes A and B are 5 and 4 respectively, 
then the load;^ = -5 and loadg = -4. Thus, the load of B is higher than that of A. 

In the following this metric will be used for load and whenever the term load is used, length_of_run_queue - 
overload_mark is meant. 

Each computer in the network stores a sorted vector which holds information on other computers in the network. 
Each entry in the vector contains: 

1 ) the rank of the computer 

2) the local load of the computer 

3) the distance in terms of the number of edges or links the load information has traversed 

4) the last node that propagated this information (last node field) 

The length of the load vector, ie the number of entries it contains, may vary from node to node. 
Load information is distributed over the network in the following way. 

1. Periodically each computer samples its load L, sorts its load vector using the new value of L, and sends the 
load vector to its parent in an update_up message. The update_up message is also used as the indication that 
the node is operational for the purposes of tree maintenance. 

2. When an update_up message is received by node m from node k, 

a The distance of each entry in the received load vector is incremented by 1. 

b All entries in the received load vector which originated from receiving node m are deleted. 

c All entries in the load vector which originated from the sending node k are deleted. 

d The distances of entries (computers) which appear in both the local vector and the received vector are 
compared: 

1 . If the distance of local entry is greater or equal to the distance of the received entry then the local entry 
is deleted. 

2. If the distance of the received entry is greater than the distance of the local entry then the received 
entry is deleted. 

e The last node field in the entries of the received vector is set to be the sending node k. 
f The received load vector is merged with the local load vector 

3. A parent sends its new load vector to the child who sent the update_up message. This message is an 
update_down message which again is used as an indication that the parent is operational in the tree maintenance 
message. 

4. When an update_down message arrives, the receiving child processes the received load vector in the same 
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way as described in step 2. 

The update_down messages propagate information received by each node from the nodes to which they are linked 
in the tree structure. Thus children share information by passing it up to the parent who passes it down to the other 

s children. By propagating load information up and down the tree, information on under-loaded nodes In another sub- 
tree can reach any node. Due to the fixed size of the load vector, each computer holds information on a sub-set of the 
computers that are least loaded. 

The load information In a load vector is only an approximation of the real load on any particular computer, since 
by the time it reaches another node the load of the originating computer may have changed. 

JO The 'update Interval' is defined as the time interval between.successive load distributions; le update_up messages. 

This interval significantly influences the results of load balancing. Too large a value renders the load information of the 
current node in other nodes inaccurate. Migration decisions based on this information result in many rejections due to 
the inaccuracy of the information. Too small a value; ie a higher update rate, increases the overhead of the method 
and response time is degraded. The update rate (1/updateJnten/al) should be proportional to the rate of generation 

IS of local processes and should change as this rate changes. In addition there is another factor which has to be consid- 
ered. It has to be ensured that the update rate of the current node does not slow down the propagation of information 
from its children to its parent and vice versa. This will happen if the rate of the current node is low and that of its children 
or its parent is high. Therefore, the update rates of the immediate neighbours have also to be considered. 

The effective_rate of a node is defined as the result o1 a rate calculation function based on these two factors. 

20 In this embodiment of the invention the following function is used. Initially the effective rate of each node is set to 

the birth rate of local processes. For each node p the following quantities are computed: 

a) rate^ = locaLbirth_ratep 

Local birth ratep is computed as an average over a sliding time window. 

2S 

b) rate2 = max{effective_raten}/a 

for all immediate neighbours n of p. a is an attenuation factor determined by computer simulation experiments 
as described below. 

30 The effective_rate of p is given by: 

effect ive_ratep = max{rate^ ^rateg} 

35 In computer simulation experiments it was found that values o1 a in the range 1.4 to 2.0 gave good results for 

networks with evenly distributed load and networks with regions of high load. 

When a computer is overloaded it searches its local load vector for the first entry indicating a computer that can 
accept extra load, and transfers one or more tasks directly to it. The load entry of the target computer in the load vector 
is updated and the load vector is re-sorted. If the target computer cannot for whatever reason accept extra load, the 
40 sender is notified. In this case the sender deletes that entry from the load vector, retrieves the next entry and use it as 
a new target. If no node can accept extra load then no migration is initiated. 

Thus, the location policy; ie, finding a target node, depends on the sort criterion used to sort the load vector. Since 
load information is propagated in hops over the tree, the farther away a node is, the less accurate is its information. 
Therefore, in this embodiment of the evaluation the load vector is sorted according to load and distance. Entries are 
45 first sorted according to their load. Entries with equal load are then sorted in ascending order of distance. 

A node which is lightly loaded for a long time may appear as the first entry in the load vector of many nodes. This 
may cause many nodes to migrate processes to that node, thus flooding it and creating a bottleneck. Two approaches 
can be used to avoid this problem. 

An equivalence set of nodes can be defined in the load vector as those nodes with the same load and distance. 
50 For example all nodes with load -1 and distance 2 are in the the same equivalence set. In this case, the sort criterion 
ensures that all the nodes in the same equivalence set are grouped together in the vector Before a load vector is 
distributed (by update_up or update_down message) the first equivalence set in the vector undergoes permutation. 
This reduces the probability that the same node will appear in many load vectors. 

In addition, whenever a lightly-loaded node appears first in its own load vector for the first time, a counter is attached 
55 to the entry and initialised to be equal to the load of the lightly loaded node which is always less than zero. Whenever 
the vector is distributed, if that entry is still first, then the counter is incremented. If the counter becomes positive that 
entry is removed from the vector. Thus the number of nodes which are aware of the lightly loaded node is limited and 
the probability of flooding reduced. 
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Initiation of process migration depends on the frequency the local load is checked and on the value of the local 
load. The decision when to check a node's load status depends on the chosen policy. There are a nunnber of possible 
policies, including the following 

5 1. Periodic the local load is tested after sending update_up message, ie, the local load sampling is synchro- 

nised with the update_up message. 

2. Event driven The local load is tested whenever a local process is added to the run queue. 

10 The decision when to declare a computer overloaded, ie where to place the overload mark, influences the per- 

formance of the method. Placing the overload mark too low may cause too many migrations without improving response 
time. Placing it too high may degrade response time. The overload mark value needs to be optimised tor any particular 
system and the value may be different for computers in the network having different load capacities or for different job 
type profiles. The optimum value or values can be established in any given situation by the use, for example, of ap- 
propriate computer simulation experiments or by a process of trial and error. 

A controlled clustering of nodes can be achieved by an appropriate assignment of rank in the configuration files 
of each node. 

For example, if there are eleven computers a to k and it is required to form clusters of four and seven computers 
as follows, 
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the following steps are performed to rank the nodes. 
1. Each cluster is ranked separately For example 

abed 
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55 2. Each cluster is taken in turn, and rank of each node is increased by the number of nodes in previous clusters. 

For example if the 2nd cluster is chosen first. 



a t> c d r K », i J u 
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Cluster configuration is achieved by assigning different ranking files to the computers. Each computer except the 
one with lowest rank in a cluster, stores a ranking list that contains ranks of nodes in its cluster and not those outside 
the cluster. The computer with the lowest rank stores the ranking list with nodes in other clusters. 

Nodes in a cluster will therefore connect to each other since they are not aware of other nodes. Only the node with 
the lowest rank will connect to a node in another cluster, because it cannot connect to another node in the cluster. 
Figure 4a shows a sample tree formed from the nodes in the above example. 

If the maximum number of children for c is three, and the favoured parent lists of a, b and d contain only c, the 
tree described in Figure 4b may result. 

Placing j in the favoured parent list of c may generate the tree described in Figure 4c. 

Thus clustering is achieved in the embodiment of the invention by judicious choice of ranking files and favoured 
parents lists for each node. 

Whilst the invention has been described in terms of an embodiment using a sender-initiated load balancing policy 
in which an overloaded node initiates the transfer of tasks, it will be clear to those skilled in the art that the invention 
could equally well be applied to a receiver-initiated scheme in which a lightly loaded computer initiates transfer of tasks 
from a heavily loaded computer to itself. 
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There has been described a method o1 operating a connputer in a network and operating a network of computers 
using an improved toad balancing technique which is scalable, fault tolerant, flexible and supports clustering thus 
making it suitable for use in networks having very large numbers of computers. 



Claims 

1. A method of operating a computer in a network of computers, the method comprising: 

generating logical links between the computer and other computers in the network so that a tree structure is 
formed, the computer being logically linked to one cpmputer higher up the tree and a number of computers 
tower down the tree; and characterised by 

maintaining in the computer stored information regarding the current load on the computer and the load on at 
least some of the other computers in the network by causing the computer periodically to distribute the infor- 
mation to the computers to which it is logically linked, to receive from said computers similar such information 
and to update its own information in accordance therewith, so that the information can be used to determine 
computers in the network that can accept extra load; 

wherein the information stored in the computer contains a number of entries, each entry containing information 
regarding: 

(i) the load on one of the computers in the network, 

(ii) the number of links in the tree separating that other computer from the computer, and 

(iii) the one of the computers, which are linked to the computer, from which the entry was last received; 

and wherein when the computer receives the similar information from a computer to which it is linked the 
following steps are performed: 

(a) the number of links value in each entry of the received information is incremented by one; 

(b) entries in the received information which originated from the computer are deleted; 

(c) entries in the information already stored in the computer which were received from the other computer 
are deleted; 

(d) the received information Is merged with the information already stored in the computer; 

(e) the merged information is sorted in ascending order of load, entries with equal load being sorted in 
ascending order of number of links separation from the computer. 

2. A method as claimed in claim 1 comprising the step of, when the computer is overloaded, using the information 
to determine a computer that can accept extra load and transferring at least one task to that computer 

3. A method as claimed in claim 1 or claim 2 comprising, if the computer higher up the tree, to which the computer 
is logically linked fails or is otherwise inoperative, generating a new logical link to another computer in the network 
which has capacity for accepting new downward links. 

4. A method as claimed in any preceding claim wherein the periodic distribution of load information is used by the 
computer to detei-mine whether or not the computers to which it is linked in the tree, are operational, the computer 
being marked, if one of the computers lower in the tree to which it is linked is not operational, as having capacity 
to accept new downward links. 

5. A method as claimed in claim 1 wherein before each distribution of information entries in the information relating 
to computers which have the same load and number of links separation from the computer are randomly permuted. 
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6. A method as claimed in claim 1 or 5 wherein if an entry corresponding to the computer appears first in its own 
sorted information a counter is attached to the entry and initialised to negative value of the spare load capacity of 
the computer, the counter being Incremented, it that entry is still first, whenever the information is distributed, and 
if the counter becomes positive that entry is removed from the information, whereby the probability that the same 

s entry will appear first In the information stored In different computers is reduced. 

7. A method as claimed in any preceding claim wherein the period of the periodic distribution of the information to a 
computer higher up the tree depends on the rate at which new tasks are created in the computer 

10 B. A method as claimed in claim 9 wherein the period of the periodic distribution depends on the rate at which new 
tasks are created in the computers to which the computer is logically linked in the tree structure. 

9. A method as claimed in any preceding claim wherein the distribution of the information to a computer lower in the 
tree lakes place in response to receipt by the computer of the similar information from the computer lower in the tree. 

75 

10. A method of operating a network of computers, comprising operating each computer using a method as claimed 
in any preceding claim. 

11. A method as claimed in claim 10 wherein the generation of the logical links is achieved by assigning a rank to 
20 each computer, no two computers being assigned the same rank, each computer being linked to one computer of 

lower rank and a number of computers of higher rank to form the tree structure. 

12. A computer capable of operating in a network of similar such computers, the computer comprising: 

25 means identifying computers in the network to which the computer is logically linked in a tree structure; 

means for storing information regarding the load on the computer and at least some of the other computers 
in the network; 

30 means for sending said information to the computers to which the computer is logically linked; 

means for receiving similar information from said computers to wnich the computer is logically linked and 
updating the stored information in accordance therewith; 

3^ means for selecting one of the other computers in the network having spare kaad capacity from said information 

and transferring tasks to the selected computer; 

characterised in that the information stored in the computer contains a number of entries, each entry con- 
taining information regarding: 

40 

(i) the load on one of the computers in the network, 

(ii) the number of links in the tree separating that other computer from the computer, and 

^5 (ill) the one of the computers, which are linked to the computer, from which the entry was last received; 

and wherein, the computer further comprises means, operable when the computer receives the similar in- 
formation from one of the other computers to which it is linked, for: 

^0 (a) incrementing the number of links value in each entry of the received information by one; 

(b) deleting entries in the received information which originated from the other computer; 



55 



(c) deleting entries in the information already stored in the computer which were received from the other com- 
puter; 

(d) merging the received information with the information already stored In the computer; 
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(e) sorting the merged information in ascending order of load, entries with equal load being sorted in ascending 
order of number of iinks separation from the computer 

13. A computer as claimed in claim 12 including: 

means for accessing a stored list of all the computers in the network; 

means for selecting a computer from this list as a candidate neighbouring computer for linking in the tree 
structure; 

means for sending a message to the selected computer indicating a link request; 

means for establishing a logical link with the selected computer by updating the identifying means if a positive 
response from the selected computer is received; 

means from receiving messages from other computers indicating that a link is requested; 

means for determining, on receipt of such a link request message, whether or not there is capacity for estab- 
lishing a downward link and sending a positive response to the sender of the link request message if there is 
such capacity and updating the identifying means accordingly. 

14. A network of computers comprising a plurality of computers as claimed in claim 12 or claim 1 3. 
Patentanspruche 

1. Verfahren zum Betrieb eines Rechners in einem Rechnernetz, wobei das Verfahren beinhallet: 

Generieren logischer Verbindungsglieder zwischen dem Rechner und anderen Rechnern im Netz, so daB sich 
eine Baumstruktur bildet, wobei der Rechner logisch mit einem im Baum hoher stehenden Rechner und mit 
mehreren im Baum tiefer stehenden Rechnern verbunden ist; und gekennzeichnet durch 

Unterhatten von im Rechner gespeicherten Informationen betreffend die augenblickliche Auslastung des 
Rechners und die Auslastung wenigstens einiger der anderen Rechner im Netz durch Bewirken, da3 der 
Rechner periodisch diese Informationen an die Rechner weitergibt, mit denen er logisch verbunden ist, um 
ahnliche solcher Informationen von diesen Rechnern zu erhalten und seine eigenen Informationen enlspre- 
chend zu aktualisieren, so daR die Informationen benutzt werden konnen, um Rechner im Netz zu bestimmen, 
die Extralasten aufnehmen konnen; 

wobei die im Rechner gespeicherten Informationen eine Anzahl Eintrage enthalten, und jeder Eintrag Infor- 
mationen uber folgende Punkte enthalt: 

(i) Die Belastung eines der Rechner im Netz, 

(ii) die Anzahl der Verbindungsglieder im Baum^ die diesen anderen Rechner vom Rechner trennen, und 

(iii) denjenigen der Rechner, die mit dem Rechner verbunden sind, von dem der Eintrag zuletzt erhalten 
wurde; 

und in dem bei Eingang ahnlicher Informationen von einem Rechner, mit dem er verbunden ist, die folgenden 
Schritte ausgefuhrt werden: 

(a) Die Hohe der Anzahl der Verbindungsglieder in jedem Eintrag der erhaltenen Informationen wird um 
eins erhoht; 

(b) Eintrage in den erhaltenen Informationen. die von dem Rechner ausgingen, werden geloscht; 

(c) Eintrage in den im Rechner bereits gespeicherten Informationen, die von dem anderen Rechner ein- 
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gingen, werden geloscht; 

(d) die eingegangenen Informationen werden mit den bereits in Rechner gespeicherten Informationen 
verbunden; 

(e) die verbundenen Infonmationen werden in aufsteigender Reihenfolge der Auslastung sortierl, Eintrage 
gleicher Auslastung werden dabei in aufsteigender Reihenfolge der Verbtndungsglieder, um die er vom 
Rechner getrennt ist, sortiert. 

Ein Vertahren gemalB Anspruch 1 , das den tolgenden Schritt enthalt: Wenn der Rechner uberlastet ist, Benutzen 
der Information, um eine Rechner zu bestimmen, der eine Extralast ubernehmen kann, und Ubertragen wenigstens 
einer Aufgabe auf diesen Rechner. 

Ein Verfahren gemaB Anspruch 1 oder Anspruch 2, das den folgenden Schritt enthalt; Wenn der hoher im Baum 
sitzende Rechner, mit dem der Rechner logisch verbunden ist, ausfallt oder sonstwie nicht betriebsbereit ist, Ge- 
nerieren eines neuen logischen Verbindungsglieds mit einem anderen Rechner im Netz, der noch Kapazitat zum 
Ubernehmen neuer, nach unten gerichteter Verbindungsglleder hat 

Ein Verlahren gemaB einem beliebigen der vorstehenden Anspruche, in dem die periodische Verteilung von Last- 
informationen vom Rechner benutzt wird, um festzustellen. ob die Rechner, mit denen er Im Baum verbunden ist, 
betriebsbereit sind oder nicht, wobei der Rechner, wenn einer der weiter unten im Baum sitzenden Rechner mit 
denen er verbunden ist, nicht betriebsbereit ist, als Kapazitat zur Annahme neuer nach unten gerichteter Verbin- 
dungen aufweisend markiert wird. 

Ein Verfahren gemaB Anspruch 1, in dem vor jeder Verteilung von Informationseintragen in die Informationen 
bezOglich der Rechner, die die gleiche Auslastung und Trennung um die gleiche Anzahl von Verbindungsgtiedern 
von dem Rechner autweisen, nach Zufallsverteilung permutiert werden. 

Ein Verfahren gemaf3 Anspruch 1 oder 5, in dem, wenn sich ein Eintrag auf den Rechner selbst bezieht, dieser 
als erster in seinen eigenen sortierten Informationen erscheint, ein Zahleran den Eintrag angeschlossen und auf 
den negativen Wert der Reserve last kapazitat des Rechners initialisiert wird, wobei der Zahler jedesmal inkremen- 
tiert wird, wenn dieser Eintrag immer noch der erste ist, wenn die Information verteilt wird, und sobald der Zahler 
positiv wird. dieser Eintrag aus der Information entfernt wird, wobei sich die Wahrscheinlichkeit, daB der gleiche 
Eintrag als erster in den in verschiedenen Rechnern gespeicherten Informationen erscheint, reduziert. 

Ein Verfahren gemaB einem beliebigen der vorstehenden Anspruche, in dem die Periode der periodischen Ver- 
teilung der Information an einen hoher im Baum sitzenden Rechner von der Rate abhangt, mit der neue Aufgaben 
im Rechner geschaffen werden. 

Ein Verlahren gemaB Anspruch 7, in dem die Periode der periodischen Verteilung von der Rate abhangt, mit der 
neue Aufgaben in Rechnern geschaffen werden, mit denen der Rechner in der Baumstruktur logisch verbunden ist. 

Ein Verfahren gemaB einem beliebigen der vorstehenden Anspruche, in dem die Verteilung der Informationen 
durch den Rechner an einen weiter unten im Baum sitzenden Rechner als Antwort auf den Eingang der ahnlichen 
Information vom weiter unten im Baum sitzenden Rechner erfolgt. 

10. Ein Verfahren zum Betreiben eines Rechnemetzes, beinhaltend das Betreiben jedes Rechners unter Verwendung 
eines Verfahrens, wie es in beliebigen der vorstehenden Anspruche beansprucht ist. 

50 11. Ein Verfahren gemaB Anspruch 10, in dem das Generieren der logischen Verbindungsglleder erzielt wird durch 
Zuweisung eines Rangs an jeden Rechner, wobei nicht zwel Rechner den gleichen Rang zugeteilt bekommen, 
jeder Rechner logisch mit einem Rechner eines niedrigeren Rangs und mit einer Anzahl Rechner hoheren Rangs 
verbunden ist, um so eine Baumstruktur zu bilden. 

55 12. Ein Rechner, der in der Lage Ist, In einem Netz aus ahnlichen solchen Rechnern zu arbeiten, wobei der Rechner 
beinhaltet; 

Mittel zum Identifizieren von Rechnern im Netz, mit denen der Rechner in einer Baumstruktur logisch verbun- 
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den ist; 

Mittel zum Abspeichern der Information en betreffend die Auslastung des Rechners und wenlgstens eintger 
der anderen Rechner inn Netz; 

Mittel zum Senden der Informationen an die Rechner, mrt denen der Rechner logisch verbunden ist; 

Mittel zum Empfangen ahnllcher Informationen von den Rechnem, mit denen der Rechner logisch verbunden 
isl, und Aktualisieren der abgespeicherten Informationen in Ubereinstimmung damit; 

Mittel zum Anwahlen eines der anderen Rechner mit frelen Auslastungskapazitaten im Netz aus diesen Infor- 
mationen und Ubertragen von Aufgaben auf diesen angewahtten Rechner; 

dadurch gekennzeichnet, daB die im Rechner gespaicherten Informationen eine Anzahl Eintrage enthalten, und 
jeder Eintrag Informationen uber folgende Punkte enthalt: 

(i) Die Belastung eines der Rechner im Netz; 

(ii) die Anzahl der Verbindungsglieder im Baum, die diesen anderen Rechner vom Rechner trennen, und 

(iii) denjenigen der Rechner, die mit dem Rechner verbunden sind, von dem der Eintrag zuletzt erhalten wurde; 

und in dem der Rechner ferner Mittel beinhaltet. die betreibbar sind, wenn der Rechner diese ahnlichen Informa- 
tionen von einem der anderen Rechner erhalt, mit dem er verbunden ist. um die folgenden Ma3nahmen zu ergrei- 
fen: 

(a) Inkrementieren des Werts der Anzahl der Verbindungsglieder in jedem Eintrag der erhaltenen Informatio- 
nen um eins; 

(b) Loschen der Eintrage in den ertialtenen Informationen, die von dem anderen Rechner stammen; 

(c) Loschen der Eintrage in den im Rechner bereits gespeicherten Informationen, die von dem anderen Rech- 
ner eingegangen waren; 

(d) Verbindung der eingegangen Informationen mit den bereits in Rechner gespeicherten Informationen; 

(e) Sortieren der verbundenen Informationen in aufsteigender Reihenfolge der Auslastung, wobei Eintrage 
gleicher Auslastung in aufsteigender Reihenfolge der Anzahl der Verbindungsglieder, um die er vom Rechner 
getrennt ist, sortlert werden. 

13. Ein Rechner gemaQ Anspruch 12, beinhaltend: 

Mittel, um auf eine gespeicherte Liste aller Rechner im Netz zuzugreifen; 

Mittel zum Wahlen eines Rechners aus der Liste als benachbarten Kandidaten-Rechner zum Verbinden in 
der Baumstruktur; 

Mittel zum Senden einer Meldung an den gewahtten Rechner mit Anforderung der Verbindung; 

Mittel zum Errichten einer loglschen Verbindung mit dem gewahlten Rechner durch Aktualisieren der Identi- 
fizierungsmittel, wenn eine positive Antwort vom angewahlten Rechner eingeht; 

Mittel zum Empfangen von Meldungen anderer Rechner, die anzelgen, daQ eine Verbindung angefordert wird; 

Mittel zum Bestimmen bei Eingang einer solchen verbindungsanfordernden Meldung, ob es eine Kapazitat 
zum Aufbau einer nach unten gerichteten Verbindung gibt oder nicht, und Senden einer positlven Meldung an 
den Sender der verbindungsanfordernden Meldung, wenn eine solche Kapazitat vorhanden ist. und entspre- 
chende Aktualisierung der Identifizierungsmittel. 
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14. Ein Rechnernetz, enthattend eine Vielzahl von Rechnern gemaO den Anspruchen 12 Oder 13. 



Revendications 

5 

1. Precede pour exploiter un ordinateur dans un reseau d'ordinateurs, le precede comprenant les phases qui con- 
sistent 

creer des liaisons logiques entre I'ordinateur et d'autres ordinateurs du reseau de mantere a former une struc- 
ture arborescente. Tordinateur etant relie logiquement a un ordinateur situe plus haut dans I'arbre et ^ plusieurs 
ordinateurs sttues plus bas dans I'arbre ; et caract6ris6 par le fait de 

consen/er dans I'ordinateur un enregistrement de rinfornnation relative h la chargeen nnennoire courante de 
I'ordinateur et de la charge en mdmoire d'au moins certains des autres ordinateurs du reseau, en obligeant 
'5 I'ordinateur a repartir periodiquement I'infornnation entre les ordinateurs auxquels il est relie logiquement, pour 

recevoir des dits ordinateurs des infornnations similaires et pour mettre h jour sa propre inlormalion conlor- 
mementace qu'ila regu, de sorte que 1' information peut etre utilisee pour determiner quels sont les ordinateurs 
du reseau qui peuvent accepter une charge supplementaire ; 

ou I'information enregistree dans I'ordinateur comprend de nombreuses entrees, chaque entree contenant 
rinformation relative a : 

(i) la charge en memoire d'un des ordinateurs du reseau. 

(ii) le nombre de liaisons d'arborescence separant cet autre ordinateur de Tordinateur, et 

(iii) celui, parmi les ordinateurs qui sont reli6s ^ I'ordinateur, depuis lequel a 6X6 envoy6e I'entr^e regue 
en dernier; 

et ou, quand I'ordinateur revolt rinformation similaire envoyee par un ordinateur auquet il est relie, les phases 
suivantes sont executees : 

30 

(a) la valeur du nombre de liaisons dans chaque entree de I'information regue est augmentee d'un 
increment ; 

(b) dans information regue les enlrees qui provenaient a I'origine de I'ordinateur sont effacees; 

(c) les entrees de I'information deja enregistree dans I'ordinateur qui ont ete regues en provenance de 
3^ I'autre ordinateur sont effacees ; 

(d) I'information regue est fusionn^e ^ information d6j^ enregistree dans I'ordinateur; 

(e) rinformation fusionnee est triee par ordre croissant d'importance de la charge, !es entrees ayant une 
meme charge etant triees par ordre croissant du nombre de liaisons les separant de I'ordinateur. 

2. Precede selon la revendication 1 , comprenant la phase qui consiste, quand la memoire de I'ordinateur est trop 
chargee, a utiliser I'information pour determiner quel ordinateur peut accepter une charge supplementaire et k 
transferer au moins une tSche vers cet ordinateur. 

3. Procede selon la revendication 1 ou la revendication 2 comprenant, si I'ordinateur, situe plus haut dans I'arbre, 
45 auquel I'ordinateur est relie logiquement est defaillant ou incapable de fonctionner pour une quelconque raison, 

la creation d'une nouvelle liaison togique vers un autre ordinateur du reseau qui peut accepter de nouvelles liaisons 
vers le bas. 

4. Procede selon I'une quelconque des revendications precedentes, ou la repartition periodique de I'information re- 
5^ lative a la charge en memoire est utilisee par I'ordinateur pour determiner si les ordinateurs auxquels il est relie 

dans I'arborescence sont operationnels ou pas, I'ordinateur etant signale comme pouvanl accepter de nouvelles 
liaisons vers le bas si I'un des ordinateurs auxquels 11 est relie et qui se trouve plus bas dans I'arborescence, n'est 
pas op6rationnel. 

55 5. Procede selon la revendication 1 ou, avant chaque repartition de I'information, les entrees relatives aux ordinateurs 
qui ont la mdme charge et le m6me nombre de liaisons de separation avec I'ordinateur sont permut^es au hasard. 

6. Procede selon la revendication 1 ou 5 ou, si une entree correspondant a I'ordinateur apparait en premier dans sa 
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propre information triee, un compteur est associe ^ I'entree et initialise avec la valeur negative de la capacite de 
charge restante de rordinateur, le connpteur augmentant d'un incr6nnent, si cette entr6e est toujours la premidre, 
quand 1' information est repartie et, si le compteur devient posttif cette entree est retiree de Tinformation. grace k 
quoi la probabilite pour que la meme entree apparaisse en premier dans I'information enregistree dans diflerents 
ordlnateurs se trouve r6duite. 

7. Precede selon Tune quelconque des revendications precedentes, ou la periodicite de la repartition periodique de 
I'information vers un ordinateur plac6 plus haul dans rarborescence depend de la vltesse ^ laquelle de nouveltes 
taches sont creees dans I'ordinateur. 

8. Precede selon la revendicalion 9 ou la periodicite de la repartition p6riodique depend de la vitesse k laquelle de 
nouvelles taches sont creees dans les ordinateurs auxquels i'ordinateur est relie logiquement dans la structure 
arborescente. 

9. Precede selon I'une quelconque des revendications precedentes ou la repartition de I'information vers un ordina- 
teur plac6 plus bas dans I'arbre est ex6cut6e en r6ponse k la r6ception, par Terdinateur, de rinformation similaire 
emanant de I'erdlnateur place plus bas dans I'arbre. 

10. Proc6d6 pour exploiter un r6seau d'ordinateurs, ouchaque ordinateur est exploit6en utilisantunproc6d6conforme 
a I'une quelconque des revendications precedentes. 

11. Proc666 selon la revendication 10, ou on obtient la creation des liaisons logiques en asslgnant un rang k chaque 
ordinateur, deux ordinateurs ne se voyant pas assigner le meme rang, chaque ordinateur etant relie a un ordinateur 
de rang inferieur et a plusieurs ordinateurs de rang superieur pour former la structure arborescente. 

12. Ordinateur pouvant etre exploite dans un reseau compose d'erdinateurs similaires, I'ordinateur comprenant : 

un moyen identifiant les ordinateurs du reseau auxquels I'ordinateur est relie logiquement dans une structure 
arborescente ; 

un moyen pour enregistrer I'information relative a la charge en memoire de I'ordinateur et acelle d'au moins 
certains des autres ordinateurs du reseau ; 

un moyen envoyant ladite information aux ordinateurs auxquels I'ordinateur est relie logiquement; 

un moyen pour recevoir I'intormation similaire envoyde par les dits ordinateurs auxquels I'ordinateur est reli6 
logiquement et pour mettre a jour rinformation enregistree, conlormement a ce qui a ete regu ; 

un moyen pour s6lectionner, k partir de ladite information, un des autres ordinateurs du r6seau auquel il reste 
de la capacite de charge et pour transferer des taches a I'ordinateur selecttonne ; 

caracterise en ce que I'information enregistree dans I'ordinateur comprend plusieurs entrees, chaque entree con- 
tenant de I'information relative a : 

(i) la charge en memoire d'un des ordinateurs du reseau, 

(ii) le nombre de liaisons, dans I'arbre, separant cet autre ordinateur de I'ordinateur et 

(iii) parmiceux qui sont relies a I'ordinateur, I'ordinateur de puis lequel I'entree regue en dernier a ete envoyee ; 

et ou I'ordinateur comprend en outre un moyen qui, quand I'ordinateur re^it I'information similaire envoyee par 
un ordinateur auquel il est relie. sert a : 

(a) ajouter un increment k la valeur du nombre de liaisons dans chaque entree de I'information re^ue ; 

(b) effacer dans I'information re9ue les entries qui provenaient k I'origine de I'autre ordinateur; 

(c) effacer les entrees de rinformation deja enregsitree dans I'ordinateur qui ont ete refues d'un autre 
ordinateur ; 

(d) fusionner I'information refue avec I'information 66\k enregistree dans I'ordinateur; 

(e) trier par ordre croissant d'importance de la charge {'information fusionnee, les entrees ayant une meme 
charge etant c!ass6es par ordre croissant du nombre de liaisons les separant de I'ordinateur 
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Ordinateur selon la revendication 12 comprenanl : 

un moyen pour acceder^ renregistrement d'une liste de tous les ordinateurs du reseau ; 

un moyen pour s6lectionner, ^ partir de cette liste, un ordinateur voisin comme candidat possible pour la 
creation d'une liaison dans la structure arborescente ; 

un moyen pour envoyer ^ I'ordinateur s6lectlonn6 un message indiquant une requite de liaison ; 

un moyen pour etablir une liaison loglque avec I'ordinateur selectionne en mettant ^ jour le'moyen d'identifi- 
cation si une reponse positive a 6t6 re9ue de I'drdlnateur selectionn6 ; 

un moyen pour recevoir les messages des autres ordinateurs indiquant qu'il y a une requete pour une liaison ; 

un moyen pour determiner, sur reception d'un message de requete de liaison, s'il est possible ou non d'etablir 
une liaison vers le bas et pour envoyer une r6ponse positive & I'^metteur du message de requite de liaison 
si c'est possible et pour mettre k jour en consequence le moyen d' identification. 

R6seau d'ordinateur comprenant une plurality d'ordinateurs tels que revendtqu6s dans la revendication 1 2 ou 13. 
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